Growing up in the Kansas City (Missouri) area, some streets had a reputation for being more “dangerous” compared to other areas. A few names that come to mind include The Paseo, Troost, Swope Park, and Prospect.
In addition to the “word on the street” that I have heard from others regarding where danger likely lurks, I draw upon anecdotal evidence from my experience walking various parts of downtown KC as a mail carrier at the Post Office. Some areas had streets with houses whose property values would be estimated to be worth more than the value of my statistical life, while a parallel street two blocks away would be lined with abandoned houses and closed business with windows decorated by bullet holes.
At the same time, it was not necessarily the entire street was considered dangerous. The “Troost Avenue” that runs through the campus of the UMKC is the same “Troost” that might go through an area known to be a hotspot for illegal activity. If my words alone are not enough, you can also explore the lyrical prose of KC-based artist Tech N9ne and will likely hear one or more of these streets named within a few songs as he explores topics including violence in the urban setting.
These thoughts led to questions about the validity of these notions. Are some streets really more dangerous than other streets? Are specific parts of certain streets more likely to have dangerous activity? Are these reputations for notoriety nothing more that urban legend with no factual basis and ought not be heeded?
These questions and more we shall explore in this analysis.
The data being used is a crowd-sourced attempt to collect and aggregate information regarding gun violence in the United States. To narrow the scope of interest of our analysis, the data is filtered to only include statistics of incidents occurring in Kansas City, Missouri (excluding observations from Kansas City, Kansas).
We must consider the fact that the validity of the conclusions we make throughout this analysis depends on the validity of the data itself. While proper cleaning, filtering, manipulation, and/or grouping of the data would help to avoid double-counting or excluding information, the validity of our assumptions is largely determined by the accuracy of the data collected its closeness to the “universal truth” regarding the reality of gun violence across America.
For example, if there are incidents of gun violence that are not captured in this data, then our model will have bias. This could result from differences between the crowd-sourced information and the information that is reflected officially with police reports and/or in formal reports regarding crime statistics. This bias could also result from occurrences of gun violence that are not collected by any data-collecting agency because they go unreported or unrecognized.
Loading data and packages
pacman::p_load(
dplyr, # For syntax
stringr,# For sub-setting syntax
mapview,# For plotting lat/long coordinates
ggplot2,# For plotting more visuals
tidyr, # For more syntax
ggpubr, # For arranging and formatting plots
forcats # For a function
)
guns = read.csv("Gun_Violence_Archive.csv")
# Filtering for area of interest and flagging duplicate incident-IDs
dfg = guns %>%
filter(state == "Missouri" & city == "Kansas City") %>%
mutate(flag_interscetion = if_else(grepl(" and ", address),T,F),
flag_block = if_else(grepl(" block of ", address),T,F),
dup_id = duplicated(incidentid)) %>%
filter(dup_id == FALSE)
Next, creating a list of what we will consider to be the most-dangerous streets determined by frequency of incidents that occur by street name grouping.
# First working data of incidents NOT occurring at an intersection
dfg %>% filter(flag_interscetion == F) -> not_intersection
not_intersection$address = gsub(" block of ", " ", not_intersection$address)
not_intersection$address = gsub(" Block of ", " ", not_intersection$address)
not_intersection$address = gsub(" Street", " St", not_intersection$address)
not_intersection$address = gsub(" Avenue", " Ave", not_intersection$address)
not_intersection$address = gsub(" Boulevard", " Blvd", not_intersection$address)
# Changing names to cleaner reference
not_intersection = not_intersection %>%
tidyr::separate(., address, into = c("st_number", "st_name"), sep = "^\\S*\\K\\s+")
# Table featuring frequencies of incidents by unique street names
as.data.frame(table(not_intersection$st_name)) -> freq_not_int
# Next working data of incidents that DOES occur at an intersection
is_intersection = dfg %>% filter(flag_interscetion == T) %>%
tidyr::separate(., address, c("st_1","st_2"), sep = " and ", remove = F)
is_intersection$st_1 = gsub(" Street", " St", is_intersection$st_1)
is_intersection$st_1 = gsub(" Avenue", " Ave", is_intersection$st_1)
is_intersection$st_1 = gsub(" Boulevard", " Blvd", is_intersection$st_1)
is_intersection$st_2 = gsub(" Street", " St", is_intersection$st_2)
is_intersection$st_2 = gsub(" Avenue", " Ave", is_intersection$st_2)
is_intersection$st_2 = gsub(" Boulevard", " Blvd", is_intersection$st_2)
# Tables featuring frequencies of incidents by unique street names
# Split by "Street A" and "Street B" of the intersection of the incident
as.data.frame(table(is_intersection$st_1)) -> freq_is_int1
as.data.frame(table(is_intersection$st_2)) -> freq_is_int2
# Arranging frequency-lists in descending order and selecting top 10 values
freq_is_int1 %>% arrange(desc(Freq)) %>% top_n(10) -> tab1
freq_is_int2 %>% arrange(desc(Freq)) %>% top_n(10) -> tab2
freq_not_int %>% arrange(desc(Freq)) %>% top_n(10) -> tab3
# Combining all sets
rbind(tab1, tab2, tab3) %>%
arrange(desc(Freq)) %>%
top_n(10) -> temp
# Removing duplicated street names
temp = temp[!duplicated(temp$Var1), ]
# Frequency of incidents by street names
temp %>% head(.)
## Var1 Freq
## 1 Prospect Ave 31
## 2 Independence Ave 19
## 3 Benton Blvd 16
## 4 Troost Ave 15
## 5 The Paseo 14
## 7 Indiana Ave 13
Now that we have obtained a list of what we are considering to be the streets with the most occurrences of gun violence, we can take this list and return to out initial data set.
# Creating data frame for each individual street
temp_pros <- dfg[grep("Prospect", dfg$address), ]
temp_ind <- dfg[grep("Independence", dfg$address), ]
temp_ben <- dfg[grep("Benton B", dfg$address), ]
temp_troost <- dfg[grep("Troost", dfg$address), ]
temp_paseo <- dfg[grep("Paseo", dfg$address), ]
# Creating variable `street_name` for plain name for each observation
temp_pros$street_name <- "Prospect"
temp_ind$street_name <- "Independence"
temp_ben$street_name <- "Benton Blvd"
temp_troost$street_name <- "Troost"
temp_paseo$street_name <- "Paseo"
# Creating combined data with all streets of interest
rbind(temp_pros, temp_ind, temp_ben, temp_troost, temp_paseo) -> asdf
# Creating histogram showing distribution of lat/long
## Latitudes
ggplot(asdf, aes(x=latitude, fill=street_name)) +
geom_histogram(alpha=0.6, position='identity') +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank()
)-> p1
ggplot(asdf, aes(x=latitude, fill=street_name)) +
geom_histogram(alpha=0.6, position='identity') +
xlim(38.95,39.12)+
labs(subtitle = "Excluding Outliers") +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size=8, face = "italic")
)-> p2
ggarrange(p1,p2,common.legend = T) %>%
annotate_figure(.,
top = text_grob(
"Frequency Distribution of Latitudinal Coordinates by Street",
face = "bold.italic", size=14),
left = text_grob(
"Count of Occurences",color = "black",rot = 90,face="bold"),
bottom = text_grob(
"Degrees Latitude", face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Warning: Removed 10 rows containing missing values (geom_bar).
## Longitudes
ggplot(asdf, aes(x=longitude, fill=street_name)) +
geom_histogram(alpha=0.6, position='identity')+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank()
) -> p3
ggarrange(p3) %>%
annotate_figure(.,
top = text_grob(
"Frequency Distribution of Longitudinal Coordinates by Street",
face = "bold.italic", size=14),
left = text_grob(
"Count of Occurences",color = "black",rot = 90,face="bold"),
bottom = text_grob(
"Degrees Longitude", face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By filtering the KCMO data to only include incidents that occur on the list of streets of interest, we can create the histograms above to show the frequency distribution of longitude/latitude coordinate ranges for each individual street.
Looking at the distribution of latitudes, we can see that all streets
apart from Independence Ave are relatively uniformly
distributed between the ranges of 38.95 and 39.12. If we look at the
histogram for longitudes, we can see that apart from
Independence Ave all streets are clustered around the same
general longitude.
These trends are understandable when we think about the distributions
intuitively, or with the visual aid of the points for each street
plotted on the map below the following paragraph. Apart from
Independence Ave, the rest of the streets run north/south
and thus would make sense that the each street is clustered around the
same longitudinal coordinates.
Similarly, the exception of Independence Ave when
looking at the distribution of latitudinal coordinates make sense since
Independence Ave runs east/west and thus we would expect
the points to range of coordinates to cluster around the same value.
mapview(asdf, label = TRUE,
xcol = "longitude", ycol = "latitude", crs = 4269,
zcol = "street_name", legend = T)
Now considering the insight from understanding the frequency
distributions of these selected streets and a visual representation, we
can define what we will call bad areas. In the scope of
this analysis, we will define a bad area is an area that
lies with the within the ranges of specified values for
latitude/longitude coordinates and street names where we would expect
higher rates of gun-related violence compared to other locations.
To formally define these parameters, we can look at the range of longitudinal values for streets running north/south and the latitudinal values for streets running east/west.
# Removing outliers as aforementioned -- total of 4 observations
asdf %>%
filter(latitude > 38.94 & latitude < 39.13) %>%
group_by(street_name) %>%
summarise(
"Range of Lat." = range(latitude),
"Range of Long."= range(longitude))
## # A tibble: 10 × 3
## # Groups: street_name [5]
## street_name `Range of Lat.` `Range of Long.`
## <chr> <dbl> <dbl>
## 1 Benton Blvd 39.0 -94.5
## 2 Benton Blvd 39.1 -94.5
## 3 Independence 39.1 -94.6
## 4 Independence 39.1 -94.5
## 5 Paseo 39.0 -94.6
## 6 Paseo 39.1 -94.6
## 7 Prospect 39.0 -94.6
## 8 Prospect 39.1 -94.6
## 9 Troost 39.0 -94.6
## 10 Troost 39.1 -94.6
Now we can use this information we can formally define our parameters
with an if statement, where we can run the following chunk
of code to go through the data and flag observations that fit our
definition of a bad_area.
bad_area = dfg %>% mutate(bad_geo = if_else(
(# Condition for BENTON
(latitude > 38.94 & latitude < 39.13) & (longitude > -94.5496 & longitude < -94.5432)) |
(# Condition for PASEO
(latitude > 38.94 & latitude < 39.13) & (longitude > -94.5704 & longitude < -94.5623)) |
(# Condition for PROSPECT
(latitude > 38.94 & latitude < 39.13) & (longitude > -94.5858 & longitude < -94.5508)) |
(# Condition for TROOST
(latitude > 38.94 & latitude < 39.13) & (longitude > -94.5769 & longitude < -94.5697)) |
(# Condition for INDEPENDENCE
(latitude > 39.1045 & latitude < 39.1158) & (longitude > -94.5639 & longitude < -94.5079)),
T,F))
bad_area$bad_name <- ifelse(
grepl("Prospect",bad_area$address)|
grepl("Independence",bad_area$address)|
grepl("Benton",bad_area$address)|
grepl("Troost",bad_area$address)|
grepl("Paseo",bad_area$address), T,F)
After creating indicator columns for observations that fall within our range of “bad” coordinates and for those that have “bad” names, we can see how the number of rows varies depending on which condition we use for defining a “bad” observation.
length(which(bad_area$bad_geo == bad_area$bad_name & bad_area$bad_name == TRUE))
## [1] 187
print("^^ # of rows where obs. determined by bad_geo AND bad_name")
## [1] "^^ # of rows where obs. determined by bad_geo AND bad_name"
length(which(bad_area$bad_name == TRUE))
## [1] 202
print("^^ # of rows where obs. determined by bad_name")
## [1] "^^ # of rows where obs. determined by bad_name"
length(which(bad_area$bad_geo == TRUE))
## [1] 653
print("^^ # of rows where obs. determined by bad_geo")
## [1] "^^ # of rows where obs. determined by bad_geo"
By plotting these different scenarios on a map, we can gain a better understanding of how the selection of approach would affect our end calculations.
# Creating mapview for each condition option, where condition of "bad_name" has
# already been shown
bad_area %>% filter(bad_geo == TRUE & bad_name == TRUE) %>%
mapview(., label = TRUE,
xcol = "longitude", ycol = "latitude",
crs = 4269, legend = T, cex=3, color = "black")
bad_area %>% filter(bad_geo == TRUE) %>%
mapview(., label = TRUE,
xcol = "longitude", ycol = "latitude",
crs = 4269, legend = T, cex=3, color = "red")
As we can see above, the general shape of the points on both plots are similar (due to the range limitations of coordinates). When filtering by coordinates AND street name (shown with the purple points) would produce a more conservative estimate.
While this conservative-grouping might be true to the original question of this analyis, at the same time we could recognize that criminal behavior doesn’t necessarily follow a strict code-of-conduct (ya don’t say?) regarding where violent crimes are to occur and thus a spillover effect could be present surrounding “dangerous streets” thus resulting in “dangerous areas” clustered around the dangerous streets. Therefore, the grouping that seem to be more generous might provide a relevant perspective.
Now that the data has been explored, transformed, and visualized, we can think about the implications of our findings in terms of useful insight. Specifically, we could find a numeric answer to the question “How do our chances of encountering gun-related violence in KC change depending on where specifically in the city we are or the specific street?”.
To answer this question, we can think about the data and the city of KC and our groupings in terms of square-miles. Then we can establish a perimeter for our declaration of “zones” with specified ranges for the coordinates. After having formally divided zones, we can compare rates-of-violence bewteen different zones and the difference in rates of gun-violence between areas captured within a defined zone compared to areas not included in the defined zones (i.e. the rest).
Creating a data set for each zone
# The term 'money' refers to the nature of the assignment from my education upon
# which this analysis is based -- to provide a "moneyshot figure" estimate
ctrl_group = guns %>%
filter(state == "Missouri" & city == "Kansas City")%>%
filter(
latitude > 38.8759 & latitude < 39.1184 &
longitude > -94.60175 & longitude < -94.58
) %>%
mutate(merged_data = "B: Same Land Area as A") %>%
select(1:26,merged_data)
ctrl_group2 = guns %>%
filter(state == "Missouri" & city == "Kansas City")%>%
filter(longitude >= -94.58 & latitude <= 39) %>%
mutate(merged_data = "C: Greater Land Area than A") %>%
select(1:26,merged_data)
ctrl_group3 = guns %>%
filter(state == "Missouri" & city == "Kansas City")%>%
filter(longitude > -94.60175 & latitude > 39.13) %>%
mutate(merged_data = "D: Greater Land Area than A") %>%
select(1:26,merged_data)
ctrl_group4 = guns %>%
filter(state == "Missouri" & city == "Kansas City")%>%
filter(longitude > -94.54 & latitude < 39.05 &
latitude > 39 & longitude < -94.436) %>%
mutate(merged_data = "E: Same Land Area than A") %>%
select(1:26,merged_data)
ctrl_group5 = guns %>%
filter(state == "Missouri" & city == "Kansas City")%>%
filter(longitude > -94.54 & latitude >= 39.05 &
latitude < 39.105 & longitude < -94.436) %>%
mutate(merged_data = "F: Same Land Area than A") %>%
select(1:26,merged_data)
test_group = bad_area %>% filter(bad_geo == TRUE) %>%
filter(longitude >= -94.58 & latitude > 39) %>%
mutate(merged_data = "A: The Bad Area") %>%
select(1:26,merged_data)
rbind(ctrl_group, test_group, ctrl_group2,
ctrl_group3, ctrl_group4, ctrl_group5) -> money_df
mapview(money_df, label = TRUE,
xcol = "longitude", ycol = "latitude", crs = 4269,
zcol = "merged_data", legend = T)
While much of existing literature attempts to define characteristics of cities in per capita terms, I thought “why not consider measurements violence per unit of land area?”. While one potential downside with this approach could be that the measure only captures absolute values (bigger cities with more people would be expected to have higher violence-per-land-unit rates), we can derive some insightful conclusions. These calculations are shown below:
nc1 = nrow(ctrl_group)
nc2 = nrow(ctrl_group2)
nc3 = nrow(ctrl_group3)
nc4 = nrow(ctrl_group4)
nc5 = nrow(ctrl_group5)
nt = nrow(test_group)
df = as.data.frame(cbind(nc1, nc2, nc3, nc4, nc5, nt))
print(paste(sum(t(df)),"-- total # of observations in each zone"))
## [1] "1529 -- total # of observations in each zone"
paste(round((df$nc1)/1529,digits=4), "- Ctrl 1 as percent of total n")
## [1] "0.1321 - Ctrl 1 as percent of total n"
paste(round((df$nc4)/1529,digits=4), "- Ctrl 4 as percent of total n")
## [1] "0.0785 - Ctrl 4 as percent of total n"
paste(round((df$nc5)/1529,digits=4), "- Ctrl 5 as percent of total n")
## [1] "0.1831 - Ctrl 5 as percent of total n"
paste(round((df$nt)/1529,digits=4), "- BAD AREA as percent of total n")
## [1] "0.3388 - BAD AREA as percent of total n"
final_fig = as.data.frame(t(df)) %>%
rename("n as pct of total n_obs" = V1) %>%
mutate(rowid = seq(nrow(.))) %>% filter(rowid!=2|rowid!=3)
By performing simple algebraic computations we can find the zone area by considering the distance between different points of the zone boundaries in terms of miles as opposed to latitudes/longitudes..
paste(round((df$nc1)/19.23,digits=4), "- Ctrl 1: occurences per sq. mi.")
## [1] "10.5044 - Ctrl 1: occurences per sq. mi."
paste(round((df$nc4)/19.23,digits=4), "- Ctrl 4: occurences per sq. mi.")
## [1] "6.2402 - Ctrl 4: occurences per sq. mi."
paste(round((df$nc5)/19.23,digits=4), "- Ctrl 5: occurences per sq. mi.")
## [1] "14.5606 - Ctrl 5: occurences per sq. mi."
paste(round((df$nt)/19.23,digits=4), "- BAD AREA: occurences per sq. mi.")
## [1] "26.9371 - BAD AREA: occurences per sq. mi."
Now we can return to the full KC data set. Considering that the area in square miles of the entire city is approximately 319, we can perform similar calculations to get an estimated gun-violence rate from all areas of the city not included in the zones shown above.
# Getting list of values we DONT want to include in calculations
rbind(ctrl_group, ctrl_group4, ctrl_group5, test_group) -> trash
dfg %>% filter(incidentid %in% trash$incidentid) -> not_trash
# Total city sq.mi - zone area*n_zones for calculation
(nrow(not_trash))/(319-(19.3*4))
## [1] 3.957816
pct_change = function(x,y){
results = abs(((y-x)/x)*100)
return(results)
}
paste("There is a",
round(pct_change(3.957816, 10.5044),digits = 3),
"percent increase in the occurence rate of incidents of gun-related violence in",
"ZONE B (ctrl-group1)",
"from the rate of an area outside our zones of interest")
## [1] "There is a 165.409 percent increase in the occurence rate of incidents of gun-related violence in ZONE B (ctrl-group1) from the rate of an area outside our zones of interest"
paste("There is a",
round(pct_change(3.957816, 6.2402),digits = 3),
"percent increase in the occurence rate of incidents of gun-related violence in",
"ZONE E (ctrl-group4)",
"from the rate of an area outside our zones of interest")
## [1] "There is a 57.668 percent increase in the occurence rate of incidents of gun-related violence in ZONE E (ctrl-group4) from the rate of an area outside our zones of interest"
paste("There is a",
round(pct_change(3.957816, 14.5606),digits = 3),
"percent increase in the occurence rate of incidents of gun-related violence in",
"ZONE F (ctrl-group5)",
"from the rate of an area outside our zones of interest")
## [1] "There is a 267.895 percent increase in the occurence rate of incidents of gun-related violence in ZONE F (ctrl-group5) from the rate of an area outside our zones of interest"
Now to formally express aj answer to the questions that were the inspiration for this analysis:
paste("In Kansas City, Missouri, there is an estimated",
round(pct_change(3.957816, 26.9371),digits = 4),
"percent increase in the occurence-rate of gun-related violent incidents in",
"the BAD AREA compared the violence rate from the average of zones outside our range of interest -",
"where the BAD AREA is composed of the streets from out `most-dangerous` list and the",
"nearby geographic area to capture potential spillover effects and aid in simplicity",
"of calculations.")
## [1] "In Kansas City, Missouri, there is an estimated 580.6052 percent increase in the occurence-rate of gun-related violent incidents in the BAD AREA compared the violence rate from the average of zones outside our range of interest - where the BAD AREA is composed of the streets from out `most-dangerous` list and the nearby geographic area to capture potential spillover effects and aid in simplicity of calculations."
While there are a number of different approaches we could have taken in attempt to answer or questions in this analysis (or perhaps even asked different topics to explore and different questions to answer), the data seems to suggest that there are certain areas of Kansas City that experience higher rates of gun-related violent incidents compared to to other areas and these groupings can be strategically/quantitatively recognized.
In future approaches to this same topic, it would be interesting to attempt to model the data using methods such as K-NN in attempt to model the data and algorithmically determine how zones could/ought to be drawn and allowing for more interesting (and perhaps insightful) conclusions than can be reached using linear boundaries to manually defining our zones/ranges of interest.
Another possibility to an approach to explore in the future would be to perform the same general workflow as was performed in this analysis, expected more-generalized by constructing functions that would allow for any city to be analyzed, rather than only one city.
If we had data that included information on population of the city, we could use that additional information to strengthen the findings from this crowd-sourced data set. The image below depicts the population density of KCMO.
I found it interesting that the KS-MO state line just south of the river shows to have a high population density, but at the same time we know that geographic area experiences a lower rate of gun-related violent crimes - perhaps this supports the conclusion we were testing in this analysis, where if all other factors are the same (i.e. population density, etc), then certain streets are associated with higher rates of crime than other areas.
We can also see the “gaps” that appear in the mapped values of our data coincide with the different cities that exist surrounded by the KC city-limits. It it also worth noting that in our final figures, the values being represent are not filtered by time and span across the range of time for all values included in the original data. Perhaps there are trends that appear when looking at the data as a time-series, where maybe the “most-dangerous-streets/areas” in 2016 might be much safe in 2020 (or vice versa).
For personal takeaways, I thought it was interesting to model data from the city where I was born/raised and make an attempt to quantify. I also found it enjoyable/insightful learning more about spatial modeling of data, specifically infinite possibilities of applications that can be done with latitudinal and longitudinal information.
For general takeaways from this analysis, perhaps it would be a good idea to be extra-vigilant of my street-bearings and longitudinal/latitudinal position the next time I find myself walking the streets of the Kansas City MSA, lest I find myself as a statistics in a set of crowd-sourced data being used for the pursuit of higher education.